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A  MODIFIED  DYNAMIC  PROGRAMING  METHOD  FOR  MARKOVIAN  DECISION  PROBIEI-1S 

J.  MacQueen 

University  of  California,  Los  Angeles 
1.  Introduction.  Let  X^,  ...be  a  sequence  of  random  variables  taking 

values  in  a  finite  set  S,  and  controlled  by  a  decision  maker  who  at  each 
time  t  =  1,  2, . ..,  observes  X,  and  then  picks  an  action  a  belonging  to  a 

w 

finite  set  A;  then  if  X^  =  x,  the  probability  that  X^_+1  =  y  becomes 

p(y;  x,  a),  where  p  is  a  known  function.  Also,  choice  of  action  a  when 

X  =  x  earns  a  known  amount  g(x,  a)  immediately.  Future  income  is  dis- 

counted  by  a  constant  factor  or  <  1.  Thus,  if  a  is  the  action  chosen 

X 

after  observing  X  ,  t  =1,  2, . . . ,  the  discounted  return  is  defined  to  be 

X 

2 

g(X1,  a1)  +  afg(x2,  ag)  +  cfg(X^,  a^)  +  ....  A  policy  r  is  a  rule  for 
determining  each  of  the  actions  a  as  a  function  of  X  and  (possibly) 

X  X 

the  sequences  X, ,  X0, . . .X.  ..  and  a. ,  a_, . . .aA  .  If  the  nolicy  r  is 

i  c  o-I  I  cl  o-l 

used  and  X1  =  x,  the  expected  discounted  return  is  given  by  u_(x), 
say,  and  we  are  interested  in  maximizing  ur(x)  by  an  appropriate  choice 
of  r.  Let  x\*(x)  =  sup^u^x). 

This  paper  describes  a  simple  algorithm  for  this  problem  that  is 
basically  an  improved  version  of  the  standard  dynamic  programming  iterative 
scheme  (see  below).  Upper  and  lower  bounds  on  the  optimal  return  are 
produced  by  the  algorithm  at  each  iteration.  These  both  converge  mono- 
tonely  to  the  optimal  return.  Also,  the  policy  determined  at  each  stage 
achieves  a  return  at  least  as  good  as  the  corresponding  lower  bound.  The 
sequence  of  policies  produced  is  actually  the  same  sequence  produced  by 
the  dynamic  programming  method;  the  improvement  consists  of  both  better 
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information  about  convergence  of  the  sequence  of  policies,  and  the  fact 
that  as  regards  computing  u*,  the  algor ithn  is  apparently  much  faster. 

Thus,  when  the  algorithm  was  applied  to  the  automobile  replacement 
problem  described  by  Howard  [5,  p.  89],  the  upper  and  lower  bounds  were 
within  1.3$  of  u*  after  25  iterations,  at  which  time  the  optimal  policy 
was  reached.  She  mean  of  the  upper  and  lower  bounds  was  within  .08 $  of 
u*  at  this  point.  After  50  iterations  the  upper  and  lower  bounds  were 
within  .05$  of  u*  and  their  mean  was  within  .0005$  of  u*.  The  estimate 
of  u*  produced  by  the  standard  dynamic  programming  method  was  40.5$ 
below  u*  after  25  iterations;  in  fact,  after  l60  iterations,  this  estimate 
was  still  below  u*  by  1.1$.  Both  methods  require  essentially  the  same 
computations 

The  method  of  policy  iteration  required  only  9  iterations  for  the 
automobile  replacement  problem.  However,  while  otherwise  comparable,  each 
iteration  using  this  method  involves  the  "value  determination"  operation, 
which  amounts  to  solving  H  equations  in  II  unknowns,  II  being  the  number  of 
states.  Because  of  this,  it  is  not  clear  which  method  is  superior  from  a 
computational  point  of  view.  The  proposed  method  may  have  an  important 
relative  advantage  in  problems  with  a  large  number  of  states,  where  the 
value  determination  operation  presents  computational  difficulties. 

The  main  properties  of  the  algorithm  are  described  in  Theorem  2  of 
Section  3*  A  key  part  of  this  theorem  is  based  on  the  very  simple  but 
useful  relationship  contained  in  Theorem  1  of  Section  2.  Theorem  1  may  be 
of  independent  interest.  The  error  bounds  provided  by  parts  (i)  and  (iv) 

2 

In  this  comparison,  the  initial  function  used  by  both  methods  was  set 
at  zero,  and  the  percentage  errors  given  are  based  on  the  state  where  this 
error  was  maximal  using  the  proposed  method. 

■'Theorem  1  derives  from  some  joint  work  [6]  of  R.  M.  Redheffer  and 
the  author. 


of  Theorem  2  can  "be  applied  to  the  policies  and  estimates  of  the  optimal 
return  produced  by  other  methods. 

For  further  relevant  discussion  of  Markovian  decision  problems,  the 
reader  is  referred  to  papers  by  d'Epenoux  [3],  Mann  [7],  Scarf  [8],  and 
Wagner  [9]. 

2.  dotation  and  •preliminaries.  For  dealing  with  a  sequence  of  real¬ 
valued  functions  on  S,  v^,  v^, ...,  it  is  convenient  to  associate  with 

each  v  another  function  r  on  S  into  A.  such  that 
n  n  ’ 

g(x,r  (x))  +  aS  v  (y)p(y;x,r  (x))  =  max  [g(x,a)  +  a  S  v  (y)p(y;x,a)], 

u  y  u  n  a  y  n 

and  then  define  the  function  gQ  by  gn(x)  =  g(x,  r^  (x))  and  the  transfor¬ 
mation  by  (Pnf)(x)  =  2yf(y)p(y;  x,  rn(x)).  In  these  terms  the 
dynamic  programming  algorithm  is  defined  by  an  initial  function 
and  the  rule  v  =  g  +orPv,n=l,  2, ....  A  function  r  on  S 
into  A  is  termed  a  stationary  policy.  For  such  a  function,  define 

the  transformation  T  by 

r 

(T  f  )(f)  =  f(x)  -  g(x,  r(x))  -  orE  f(y)p(y;  x,  r(x)). 

•I  «> 

The  expected  return  u^  for  a  stationary  policy  satisfies  the  equation 

T  u  =  0. 
r 

Now  define  the  transf ormt.  fcion  T*  by 

(T*f)(x)  =  f (x)  -  max  [g(x,  a)  +  a  E  f(y)p(y;  x,  a)]. 

a  y 

Thus  T*v„  =  v  -  (g  +  a  P  vj. 
n  n  n  n  n 

Using  the  principle  of  optimality  [2],  we  can  easily  convince11’ 
ourselves  that  u*  satisfies  the  equation  T*u  =  0. 

Theorem  1.  T*u  <  T*v  implies  u  <  v. 

4 

For  rigorous  treatment  of  this  and  related  questions  see  LlJ  and 

!>]. 


h 


Proof .  Translated,  the  hypothesis  T*u  <  T*v  he  cares 

u(x)  -  v(x)  <  naxa[g{x3a)  -f  c£u(y)p(y;x,a)l 

-  aox  [g(x,c)  +  o£  v(y)p(y;x,a)]  <  merger  2  (u(y)-v(y))p(y;x,a) . 

“  y  **  u  y 

Suppose  the  maximum  of  the  left  side  is  m  >  0.  The  maximum  will  he 
achieved  at  a  point  x^.  Replacing  u  -  v  with  m  on  the  right  we  get 
the  contradiction, 

u(xQ)  "  v(xq)  =  n  <  max  a£y  mp(yjx0,a)  =  cm, 
and  the  proof  is  complete. 

If  there  is  only  one  action  for  each  state,  T*  is  of  the  same  form 

as  T  .  Thus,  we  have 
r 

Corollary  1.  T  u  <  T  v  imnlies  u  <  v. 

-  r  —  r  — 

An  immediate  application  of  theorem  1  is 


Corollary  2.  The  dynamic  programming  equation  T*u  =  0  has  at  most  one 
(finite)  solution. 

Proof.  If  T*u  =  T*v  =  0,  then  u  <  v  and  v  <  u  hy  Theorem  1.  Hence,  u  =  v. 
3.  The  algorithm.  let  v^  he  an  arbitrary  function  with  v^(s)  =  0 
where  s  is  a  conveniently  selected  state,  and  define  the  sequence  of 


functions  {vn}  and  the  sequences  of  constants  {i/}  and  {  i/  }, 


n+1 

=  g  + 

P  v  - 

-  (g  +  a 

P  v  )( 

n 

n  n 

\&n 

n  n/x 

L' 

=  mj.n 

(g„  + 

or  P  v  - 

v  )(x) 

n 

x 

n  n 

n 

L" 

n 

=  max 

X 

(g  + 

a  P  v  - 
n  n 

v  )(x) 
n  ' 

Notice  each  function  v  is  zero  at  S.  Now  let  t  =  (l-Q')"\  and 

n  '  * 

define  the  sequence  of  functions  {u^}  and  {u^}  hy 


* 


u 

n 

*  * 


u  =  v 


n  n 
Theorem  2.  ( 


+  tl/, 
n' 

+  tL//. 

n 

i)  The  ontimal  return  u*  satisfies  u^<  u#<u^. 
~  n  -  -  n 
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(ii)  u'  <  u'  ,  U*'  >  u ''  .  (iii)  u'  -  u*,  u"  -»  u*.  (iv)  let 
v  n  —  n+x’  n  -  n+1  v  n  n 

u*  be  the  expected  discounted  return  for  the  stationary  policy  r 
n 

Then  u*  >  u/. 
n  -  n 

Proof.  In  the  following,  let  v '  =  +  a  P_v^  -  L_'  ,  so  thai, 

q  *.  n  ii  n 

v'  >  v  ,  and  let  v'*  =  g_  +  o'  P  v  -  I>" ,  so  that  v"  <  v  . 
n  -  n'  n^n  n  n  n  '  n-n 

Also,  v  =  v'  -  v'(s)  =  v"  -  v"(s). 
y  n+l  n  nv  '  n  n 

(i)  ur  <  u*  <  u"»  As  was  pointed  out  above,  u*  satisfies 
v  '  n  —  —  n 

T*u*  =  0.  From  the  definition  of  T*  we  get, 

T*u'  =  v  +  tL'  -  [g  +  or?  v  +-  a  il'l 
n  n  n  °n  n  n  n 

=  v  +  tl'  -  fv  '  Ii*  +  a  tL'3 
n  n  •  n  n  n 

=  v  -•■/'<  0  =  T*u*. 
n  n  — 

Therefore  <  v*  by  Theorem  1.  Similarly, 

T*u"  =  v  +  tL"  -  [g  +  a  P  v  +  o'  tL"] 
n  n  n  n  n  n  n 

=  v  +  tL  -[v  +  L  +  o'  tL  J 

n  n  n  n  n 

/  / 


n 


=  v  -  v  >  0  =  T*u* 
n  n  -  ' 

and  u^'  >  u*  again  by  Theorem  1. 

(ii)  u'  <  u'  ,  u"  >  u"  .  For  convenience  we  use  1  and  2  in 
'  n  —  n+l'  n  -  n+l 

place  of  n  and  n+l.  We  have 

u'  =  v2  +  tL'  =  v2  +  t  min  fg£  +  or  PgVg  -  Vg] 

>  Vg  +  t  min  [g;L  +  <y  P-jVg  -  Vg] 

=  v^  +  t  min  Tg.  +  o'  P..v'  -  or  v'  (s)  -  v'  +  v'  (s)l 

2  x  1  11  1  1  1 

>  Vg  +  t  min  (g1  +  a  P^  -  v^  +  (l-ar)  v'  (s)] 

=  Vg  +  tL^  +  v1(s)  =  v^  +  tL^  >  v1  +  tL'  =  u^. 


Similarly, 


u  r 


= 

+  tL_  = 

: 

+  t  max 

2 

2 

2 

X 

t>^ 

ii 

+  t  max 

X 

t-G>2 

+  “P2V1 

SV2 

+  t  max 
x 

+  ffP2Vl 

“2  2 


// 
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<  v2  +  t  max  [gj,  +  orP^  -  v''  +  (l-a?)  v''(s)] 

*  *  ,  _  *  *  .  *  * 

*»  Vj  +  t  L2  <  vx  +  tl^  =  ux  . 

(iii)  Convergence  of  u^  and  u^  *  to  u*.  Convergence  itself  is 

immediate  from  the  monotonicity  and  the  fact  that  u*  is  an  upper 

hound  for  u  *  and  a  lower  hound  for  u**.  Letu  =  lim  u We  show 
n  n  ®  n 

that  u  satisfies  T*u  =  0,  and  hence  u  =  u*  hy  Corollary  2.  The 

CD  CD 

argument  is  similar  for  ua*.  Since  i/  =  u^  (s)/t  <  u*(s)/t,  lim 

=  L  is  finite.  Let  lim  v  =  v  =  u  -  L  t. 
os  n  n  “  “  so 


First  we  establish  that  v^(s)  Oj  in  fact  _  va(s)  converges. 

Considering  the  proof  of  (ii)  at  the  point  x  =  s  yields  ^2  —  ^1  + 

(l-cr)  v'(s).  Proceeding  inductively  gives  +  (1-gt)  v'(s). 

Since  L'  is  hounded  and  since  v'(s)  >  0,  T,  v'(s)  converges.  Now, 
n  n  —  n 

v'  =  v  ,  -  v'(s)  =  g  +c;Pv  -  L' ,  so  we  write 
n  n+1  nv  '  ton  n  n  n' 

vn+1(x)  -  v'(s)  =  maxa[g(x;a)  +  or  Sy  Vw(y)p(y;x,a) 

+  or  Sy  (vn(y)  -  vjy))p(y;x,a)]  -  L' 

<  maxa[g(x,a)  +  or  ~  Vrop(y;x,a)]  - 

+  maxx  maxa[cr  Z  (vn(y)  -  v£y))p(y,*x,a) j  +  -  L^. 


Taking  limits  gives 

Vco(x)  <  max  [g(x,a)  +  or  S  vro(y)p(y;x,a)]  -  L^. 

With  min  min  replacing  max  max  in  the  preceding,  the  inequality 


is  reversed  so  that  we  get  equality.  Substitution  of  v  =  u  -  tL 
gives  uo(x)  =  max  [g(x,a)  +  or  ySg  uw(y)p(yjx,a)3,  that  is,  T*uw  =  0. 

(iv)  u*  >  u*.  Define  T  as  indicated  in  Section  2,  by 
'  n  —  n  r 

n 

T  f=f-(g  +  or  P  f).  Now,  u*  =  g  +  crP  u*  ,  that  is,  T  u*  =  0. 
r  s  n  n  '  n  n  n  n;  r  n 

n  n 

Eut  T  u'  <  0  as  was  seen  in  this  proof  of  (i).  Application  of 
r  n  - 
n 

Corollary  1  gives  u^  <  u*^.  This  completes  the  proof. 
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